pacman::p_load(sf, tidyverse, tmap, spdep, funModeling)Take-home Exercise 1: Geospatial Analytics for Social Good
The Task
The specific tasks of this take-home exercise are as follows:
Using appropriate sf method, import the shapefile into R and save it in a simple feature data frame format. Note that there are three Projected Coordinate Systems of Nigeria, they are: EPSG: 26391, 26392, and 26303. We can use any one of them.
Using appropriate tidyr and dplyr methods, derive the proportion of functional and non-functional water point at LGA level.
Combining the geospatial and aspatial data frame into simple feature data frame.
Performing outliers/clusters analysis by using appropriate local measures of spatial association methods.
Performing hotspot areas analysis by using appropriate local measures of spatial association methods.
Thematic Mapping
- Plot maps to show the spatial distribution of functional and non-functional water point rate at Local Government Area (LGA) level by using appropriate thematic mapping technique provided by tmap package.
Analytical Mapping
- Plot hotspot areas and outliers/clusters maps of functional and non-functional water point rate at LGA level by using appropriate thematic mapping technique provided by tmap package.
Overview
Geospatial analytics hold tremendous potential to address complex problems faced by society. In this study, we are tasked to apply appropriate global and local measures of spatial association techniques to reveals the spatial patterns of non-functional water points. For the purpose of this study, Nigeria will be used as the study country.
Installing & Loading R Packages
In the code chunk below, p_load() of pacman package is used to install and load the following R packages into R environment:
sf
tidyverse
tmap
spdep
funModeling, to be used for rapid Exploratory Data Analysis
The Data
Aspatial data
For the purpose of this exercise, data from WPdx Global Data Repositories will be used. There are two versions of the data. They are: WPdx-Basic and WPdx+. We are required to use WPdx+ data set.
Geospatial data
Nigeria Level-2 Administrative Boundary (also known as Local Government Area) polygon features GIS data will be used in this exercise. The data can be downloaded either from The Humanitarian Data Exchange portal or geoBoundaries.
Importing Geospatial Data
Two geospatial data sets used are:
geo_exportnga_admbnda_adm2_osgof_20190417
Importing water point geospatial data
First, we are going to import the water point geospatial data (i.e. geo_export) by using the code chunk below.
(Since we have previously used this data set in the in-class exercise, we will use the data directly from there.)
wp <- st_read(dsn = "C:/Jacobche/ISSS624/In-class_Ex/rawdata",
layer = "geo_export",
crs = 4326) %>%
filter(clean_coun == "Nigeria")Things to learn from the code chunk above:
st_read() of sf package is used to import
geo_exportshapefile into R environment and save the imported geospatial data into simple feature data table.filter() of dplyr package is used to extract water point records of Nigeria only.
Note: Avoid performing transformation if you plan to use st_intersects() of sf package in the later stage of the geoprocessing. This is because st_intersects() only works correctly if the geospatial data are in geographic coordinate system (i.e wgs84).
Next, write_rds() of readr package is used to save the extracted sf data table (i.e. wp) into an output file in rds data format. The output file is called wp_nga.rds and it is saved in rawdata sub-folder, which will not be uploaded to Git.
wp_nga <- write_rds(wp,
"C:/Jacobche/ISSS624/In-class_Ex/rawdata/wp_nga.rds")Importing Nigeria LGA boundary data
Now, we are going to import the LGA boundary data into R environment by using the code chunk below.
nga <- st_read(dsn = "C:/Jacobche/ISSS624/In-class_Ex/data",
layer = "nga_admbnda_adm2_osgof_20190417",
crs = 4326)Reading layer `nga_admbnda_adm2_osgof_20190417' from data source
`C:\Jacobche\ISSS624\In-class_Ex\data' using driver `ESRI Shapefile'
Simple feature collection with 774 features and 16 fields
Geometry type: MULTIPOLYGON
Dimension: XY
Bounding box: xmin: 2.668534 ymin: 4.273007 xmax: 14.67882 ymax: 13.89442
Geodetic CRS: WGS 84
Thing to learn from the code chunk above.
- st_read() of sf package is used to import
nga_admbnda_adm2_osgof_20190417shapefile into R environment and save the imported geospatial data into simple feature data table.
Data Wrangling
Recoding NA values into string
In the code chunk below, replace_na() is used to recode all the NA values in status_cle field into Unknown.
wp_nga <- read_rds("C:/Jacobche/ISSS624/In-class_Ex/rawdata/wp_nga.rds") %>%
mutate(status_cle = replace_na(status_cle, "Unknown"))Exploratory Data Analysis
In the code chunk below, freq() of funModeling package is used to display the distribution of status_cle field in wp_nga.
freq(data=wp_nga,
input = 'status_cle')
status_cle frequency percentage cumulative_perc
1 Functional 45883 48.29 48.29
2 Non-Functional 29385 30.93 79.22
3 Unknown 10656 11.22 90.44
4 Functional but needs repair 4579 4.82 95.26
5 Non-Functional due to dry season 2403 2.53 97.79
6 Functional but not in use 1686 1.77 99.56
7 Abandoned/Decommissioned 234 0.25 99.81
8 Abandoned 175 0.18 99.99
9 Non functional due to dry season 7 0.01 100.00
Extracting Water Point Data
In this section, we will extract the water point records by using classes in status_cle field.
Extracting functional water point
In the code chunk below, filter() of dplyr is used to select functional water points.
wpt_functional <- wp_nga %>%
filter(status_cle %in%
c("Functional",
"Functional but not in use",
"Functional but needs repair"))Exploratory Data Analysis (functional)
In the code chunk below, freq() of funModeling package is used to display the distribution of status_cle field in wpt_functional.
freq(data=wpt_functional,
input = 'status_cle')
status_cle frequency percentage cumulative_perc
1 Functional 45883 87.99 87.99
2 Functional but needs repair 4579 8.78 96.77
3 Functional but not in use 1686 3.23 100.00
Extracting non-functional water point
In the code chunk below, filter() of dplyr is used to select non-functional water points.
wpt_nonfunctional <- wp_nga %>%
filter(status_cle %in%
c("Abandoned/Decommissioned",
"Abandoned",
"Non-Functional",
"Non functional due to dry season",
"Non-Functional due to dry season"))Exploratory Data Analysis (non-functional)
In the code chunk below, freq() of funModeling package is used to display the distribution of status_cle field in wpt_nonfunctional.
freq(data=wpt_nonfunctional,
input = 'status_cle')
status_cle frequency percentage cumulative_perc
1 Non-Functional 29385 91.25 91.25
2 Non-Functional due to dry season 2403 7.46 98.71
3 Abandoned/Decommissioned 234 0.73 99.44
4 Abandoned 175 0.54 99.98
5 Non functional due to dry season 7 0.02 100.00
Extracting water point with Unknown class
In the code chunk below, filter() of dplyr is used to select water points with unknown status.
wpt_unknown <- wp_nga %>%
filter(status_cle == "Unknown")Performing Point-in-Polygon Count
The code chunk below performs two operations at one go. Firstly, identify water points located inside each LGA by using st_intersects(). Next, length() of Base R is used to calculate numbers of water points that fall inside each LGA.
nga_wp <- nga %>%
mutate(`total wpt` = lengths(
st_intersects(nga, wp_nga))) %>%
mutate(`wpt functional` = lengths(
st_intersects(nga, wpt_functional))) %>%
mutate(`wpt non-functional` = lengths(
st_intersects(nga, wpt_nonfunctional))) %>%
mutate(`wpt unknown` = lengths(
st_intersects(nga, wpt_unknown)))Saving the Analytical Data Table
The code chunk below computes the proportion of functional and non-functional water point at LGA level.
nga_wp <- nga_wp %>%
mutate(pct_functional = `wpt functional`/`total wpt`) %>%
mutate(`pct_non-functional` = `wpt non-functional`/`total wpt`) %>%
select(3:4, 9:10, 18:23)Things to learn from the code chunk above:
mutate() of dplyr package is used to derive two fields namely
pct_functionalandpct_non-functional.to keep the file size small, select() of dplyr is used to retain only fields 3, 4, 9, 10, 18, 19, 20, 21, 22 and 23.
Now, we have the tidy sf data table for subsequent analysis. We will save the sf data table into rds format.
write_rds(nga_wp, "C:/Jacobche/ISSS624/In-class_Ex/data/nga_wp.rds")Visualising the spatial distribution of water points
The code below uses qtm() of tmap package to plot side-by-side choropleth maps showing the spatial water points distribution by LGA levels in Nigeria.
nga_wp <- read_rds("C:/Jacobche/ISSS624/In-class_Ex/data/nga_wp.rds")
total <- qtm(nga_wp, "total wpt") +
tm_layout(scale = 0.7)
wp_functional <- qtm(nga_wp, "wpt functional")+
tm_layout(scale = 0.7)
wp_nonfunctional <- qtm(nga_wp, "wpt non-functional")+
tm_layout(scale = 0.6)
unknown <- qtm(nga_wp, "wpt unknown")+
tm_layout(scale = 0.7)
tmap_arrange(total, wp_functional, wp_nonfunctional, unknown, nrow=2, ncol=2)
Next we will create an interactive choropleth map for non-functional water points which would allow us to zoom in for a closer look.
tmap_mode("view")
tm_shape(nga_wp) +
tm_polygons("wpt non-functional",
breaks = c(0, 71, 141, 211, 280),
palette = "Reds") +
tm_layout(title= "Spatial Distribution of Non-functional Water Points") +
tm_scale_bar()tmap_mode("plot")From the map, we can see that the distribution of non-functional water points is not even with LGAs like Ifelodun and Igabi having a higher concentration than others. Nevertheless, there seem to be areas where they are clustered - i.e. around the Central and Western region of Nigeria.
Global Spatial Autocorrelation
In order to confirm our observation of signs of spatial clustering, we will make use of global autocorrection technique. We will compute the global spatial autocorrelation statistics and perform spatial complete randomness test for global spatial autocorrelation.
Computing Contiguity Spatial Weights
Before we can compute the global spatial autocorrelation statistics, we need to construct a spatial weights of the study area. The spatial weights is used to define the neighbourhood relationships between the geographical units (i.e. LGA) in the study area.
The code chunk below uses poly2nb() of spdep package to compute the Queen contiguity weight matrix for Nigeria.
wm_q <- poly2nb(nga_wp,
queen=TRUE)
set.ZeroPolicyOption(TRUE)[1] FALSE
summary(wm_q)Neighbour list object:
Number of regions: 774
Number of nonzero links: 4440
Percentage nonzero weights: 0.7411414
Average number of links: 5.736434
1 region with no links:
86
Link number distribution:
0 1 2 3 4 5 6 7 8 9 10 11 12 14
1 2 14 57 125 182 140 122 72 41 12 4 1 1
2 least connected regions:
138 560 with 1 link
1 most connected region:
508 with 14 links
The summary report above shows that there are 774 LGAs in Nigeria. The most connected LGA has 14 neighbours. There are two LGAs with only one neighbours.
Row-standardised weights matrix
Next, we need to assign weights to each neighboring polygon. In our case, each neighboring polygon will be assigned equal weight (style=“W”).
rswm_q <- nb2listw(wm_q,
style="W",
zero.policy = TRUE)
rswm_qCharacteristics of weights list object:
Neighbour list object:
Number of regions: 774
Number of nonzero links: 4440
Percentage nonzero weights: 0.7411414
Average number of links: 5.736434
1 region with no links:
86
Weights style: W
Weights constants summary:
n nn S0 S1 S2
W 773 597529 773 285.0658 3198.414
Global Spatial Autocorrelation: Moran’s I
Maron’s I test
The code chunk below performs Moran’s I statistical testing using moran.test() of spdep.
moran.test(nga_wp$`wpt non-functional`,
listw=rswm_q,
zero.policy = TRUE,
na.action=na.omit)
Moran I test under randomisation
data: nga_wp$`wpt non-functional`
weights: rswm_q n reduced by no-neighbour observations
Moran I statistic standard deviate = 20.043, p-value < 2.2e-16
alternative hypothesis: greater
sample estimates:
Moran I statistic Expectation Variance
0.433932927 -0.001295337 0.000471516
From the Moran’s I test since the p-value < 2.2e-16 which is approximately 0, we can reject the null hypothesis at 99% confidence interval and can conclude that the distribution of non-functional water points in the LGAs are not randomly distributed. As the Moran I statistic = 0.433932927 > 0, we can infer that there is sign of “clustered” spatial pattern.
Computing Monte Carlo Moran’s I
The code chunk below performs permutation test for Moran’s I statistic by using moran.mc() of spdep. A total of 1000 simulation will be performed.
set.seed(1234)
bperm= moran.mc(nga_wp$`wpt non-functional`,
listw=rswm_q,
nsim=999,
zero.policy = TRUE,
na.action=na.omit)
bperm
Monte-Carlo simulation of Moran I
data: nga_wp$`wpt non-functional`
weights: rswm_q
number of simulations + 1: 1000
statistic = 0.43393, observed rank = 1000, p-value = 0.001
alternative hypothesis: greater
From the Monte Carlo simulation since the p-value = 0.001, we can reject the null hypothesis at 99% confidence interval and can conclude that the distribution of non-functional water points in the LGAs are not randomly distributed. As the Moran I statistic = 0.43393 > 0, we can infer that there is sign of “clustered” spatial pattern.
Global Spatial Autocorrelation: Geary’s
Geary’s C test
The code chunk below performs Geary’s C test for spatial autocorrelation by using geary.test() of spdep.
geary.test(nga_wp$`wpt non-functional`, listw=rswm_q)
Geary C test under randomisation
data: nga_wp$`wpt non-functional`
weights: rswm_q
Geary C statistic standard deviate = 14.457, p-value < 2.2e-16
alternative hypothesis: Expectation greater than statistic
sample estimates:
Geary C statistic Expectation Variance
0.6170907765 1.0000000000 0.0007014859
From the Geary’s C test since the p-value < 2.2e-16 which is approximately 0, we can reject the null hypothesis at 99% confidence interval and can conclude that the distribution of non-functional water points in the LGAs are not randomly distributed. As the Geary C statistic = 0.6170907765 < 1, we can again infer that there is sign of “clustered” spatial pattern.
Computing Monte Carlo Geary’s C
The code chunk below performs permutation test for Geary’s C statistic by using geary.mc() of spdep.
set.seed(1234)
bperm=geary.mc(nga_wp$`wpt non-functional`,
listw=rswm_q,
nsim=999)
bperm
Monte-Carlo simulation of Geary C
data: nga_wp$`wpt non-functional`
weights: rswm_q
number of simulations + 1: 1000
statistic = 0.61709, observed rank = 1, p-value = 0.001
alternative hypothesis: greater
From the Monte Carlo simulation since the p-value = 0.001, we can reject the null hypothesis at 99% confidence interval and can conclude that the distribution of non-functional water points in the LGAs are not randomly distributed. As the Geary C statistic = 0.61709 < 1, we can again infer that there is sign of “clustered” spatial pattern.